PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization
نویسندگان
چکیده
Big data analytics is a critical and unavoidable process in any business and industrial environment. Nowadays, companies that do exploit big data’s inner value get more economic revenue than the ones which do not. Once companies have determined their big data strategy, they face another serious problem: in-house designing and building of a scalable system that runs their business intelligence is difficult. The PROTEUS project aims to design, develop, and provide an open ready-to-use big data software architecture which is able to handle extremely large historical data and data streams and supports online machine learning predictive analytics and real-time interactive visualization. The overall evaluation of PROTEUS is carried out using a real industrial scenario. 1. PROJECT DESCRIPTION PROTEUS is an EU Horizon2020 funded research project, which has the goal to investigate and develop ready-to-use, scalable online machine learning algorithms and real-time interactive visual analytics, taking care of scalability, usability, and effectiveness. In particular, PROTEUS aims to solve the following big data challenges by surpassing the current state-of-art technologies with original contributions: 1. Handling extremely large historical data and data streams 2. Analytics on massive, high-rate, and complex data streams 3. Real-time interactive visual analytics of massive datasets, continuous unbounded streams, and learned models PROTEUS’s solutions for the challenges above are: 1) a real-time hybrid processing system built on top of Apache Flink (formerly Stratosphere [1]) with optimized relational algebra and linear algebra operations support through LARA declarative language [2, 3], 2) a new library for scalable online machine learning and data mining called SOLMA, and 3) investigation and development of incremental visual methods that allow end-users to efficiently explore https://www.proteus-bigdata.com/ https://ec.europa.eu/programmes/horizon2020/ https://flink.apache.org/ http://stratosphere.eu/ c ©2017, Copyright is with the authors. Published in Proc. 20th International Conference on Extending Database Technology (EDBT), March 21-24, 2017 Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 both batch and streaming data for making well-informed decisions in real time. These three subsystems will be integrated in a single platform running in a containerized environment. Once the platform is deployed in a cluster, its life-cycle is as follows: 1) the end-user writes data analytics tasks in LARA mixing extract-transform-load and SOLMA algorithms pipelines and executes them on top of PROTEUS hybrid processing system, 2) the system continuously trains deployed machine learning models in an online fashion, 3) the visual stack queries those models and displays requested real-time predictions and statistics to end-user. PROTEUS faces an additional challenge which deals with correct integration of machine learning solutions in big data processing systems by taking into account the principal anti-patterns and risks factors that affect this kind of interactions [4]. In addition, PROTEUS ensures the achievement of its goals through rigorous experimental testing and industrial-validated processes. The project is indeed guided by the specific requirements of the hot strip mill steel-making process, provided by an industrial partner of PROTEUS’ consortium. Hot strip mill produces coils, whose quality is affected by several parameters (e.g. temperature, vibration intensity, tension in the rollers). Since coils are used in further production stages, they must present no defect. Predicting anomalies through the analysis of massive real-time data generated during the hot strip mill is the main target in this validation scenario. Regardless the above validation scenario, PROTEUS platform is also applicable for general data streams analysis in other domains. Acknowledgements. This work was supported by the EU Horizon 2020 project PROTEUS (687691).
منابع مشابه
P-V-L Deep: A Big Data Analytics Solution for Now-casting in Monetary Policy
The development of new technologies has confronted the entire domain of science and industry with issues of big data's scalability as well as its integration with the purpose of forecasting analytics in its life cycle. In predictive analytics, the forecast of near-future and recent past - or in other words, the now-casting - is the continuous study of real-time events and constantly updated whe...
متن کاملReal-time Scheduling of a Flexible Manufacturing System using a Two-phase Machine Learning Algorithm
The static and analytic scheduling approach is very difficult to follow and is not always applicable in real-time. Most of the scheduling algorithms are designed to be established in offline environment. However, we are challenged with three characteristics in real cases: First, problem data of jobs are not known in advance. Second, most of the shop’s parameters tend to be stochastic. Third, th...
متن کاملPredictive Visual Analytics – Approaches for Movie Ratings and Discussion of Open Research Challenges
We present two original approaches for visual-interactive prediction of user movie ratings and box office gross after the opening weekend, as designed and awarded during VAST Challenge 2013. Our approaches are driven by machine learning models and interactive data exploration, respectively. They consider an array of different training data types, including categorical/discrete data, time series...
متن کاملVisual Search Analytics: Combining Machine Learning and Interactive Visualization to Support Human-Centred Search
Searching within large online document collections has become a common activity in our modern information-centric society. While simple fact verification tasks are well supported by current search technologies, when the search tasks become more complex, a substantial cognitive burden is placed on the searcher to craft and refine their queries, evaluate and explore among the search results, and ...
متن کاملPIVE: Per-Iteration Visualization Environment for Real-Time Interactions with Dimension Reduction and Clustering
One of the key advantages of visual analytics is its capability to leverage both humans’s visual perception and the power of computing. A big obstacle in integrating machine learning with visual analytics is its high computing cost. To tackle this problem, this paper presents PIVE (Per-Iteration Visualization Environment) that supports real-time interactive visualization with machine learning. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017